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Compelling arguments suggest the presence of new physics at energy scales that will be probed 
by frontier energy colliders over the next decade. Arguments for each of the many flavors of new 
physics that have been proposed seem much less compelling. The wide variety of experimental 
signatures by which new physics may manifest itself suggests the desirability of analyzing all high 
energy collider data in one systematic framework. These proceedings describe two potentially useful 
pieces of such a framework: Sleuth enables a model-independent search for new high-pT physics, 
and QuAERO automates tests of particular hypotheses against high energy collider data. A sampling 
of algorithmic detail is provided in the form of a procedure for choosing an optimal binning when 
computing likelihood ratios. 
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I. CONTEXT 

The audience for this talk (and these proceedings) 
comprises astrophysicists, cosmologists, and statisticians, 
in addition to high energy experimentalists. It is there- 
fore worth beginning by discussing the nature of high en- 
ergy collider data, particularly those features that make 
these data amenable to the algorithms described here. 
These data are collected by large, complex detectors that 
record on roughly a million channels the debris from the 
collisions of particles (protons, electrons, and their an- 
timatter counterparts) traveling within a few hundred 
miles per hour of the speed of light. 

The information contained in these million channels of 
electronics is reduced through a series of steps to roughly 
one dozen numbers, corresponding to the energies and di- 
rections (polar and azimuthal angles) of the elementary 
objects emerging from the collision. This severe reduc- 
tion in detail facilitates a direct connection to the un- 
derlying theory. The underlying theory is most easily 
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FIG. 1: A Feynman diagram, showing the annihilation of a 
quark (q) and antiquark (g), and the subsequent production 
and decay of a top quark (t) and an antitop quark (i). Time 
increases to the right. 



understood graphically in terms of Feynman diagrams, 
an example of which is shown in Fig. ^ Our detectors 
and algorithms (imperfectly) reconstruct the outgoing 
particles in collisions like that depicted in Fig. ^ The 
goal is to figure out, from the debris of trillions of parti- 
cle collisions, the rules corresponding to graphs such as 
that shown in Fig. rules for what types of graphs can 
be drawn, and rules for calculating observable quantities 
from them. In doing so, we infer from measurements on 
scales of meters the laws of Nature on scales of 10^^^ 
meters and below. 

The theoretical context in which we work is grounded 
in the standard model of particle physics, which predicts 
the results of nearly all experiments performed to date 
with extraordinary accuracy — and in many cases also 
with extraordinary precision. This standard model rep- 
resents a canonical reference model, the null hypothesis 
in our field. 

The theoretical landscape beyond the standard model 
is much less clear. Hundreds of different scenarios have 
been proposed, each containing many parameters. The 
lack of clarity in this picture is nicely captured in a slide 
shown during the summary talk of Lepton Photon 2003, 
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FIG. 2: The theoretical landscape, as depicted in the sum- 
mary talk of this year's Lepton Photon conference Q. 

reproduced in Fig. |21 



II. SLEUTH 

The jumbled theoretical landscape in Fig.|21 reflecting 
the plethora of possible extensions to the standard model, 
calls into question the paradigm currently being used to 
explore that landscape. At present roughly one graduate 
student is consumed for each model tested. 

An alternative way to proceed is to systematically 
search for any evidence of new physics that lies in the 
data, in a manner that is as model-independent as possi- 
ble. A prescription for doing this is an algorithm called 
Sleuth, used by the D0 experiment in Run I to search 
a large subset of their data |3, 0, S ■ 

One of many problems faced when searching for new 
physics in such a directionless landscape is how to take 
into account the large space of possible signatures that 
could appear when computing a final measure of the sig- 
nificance of any particular result. If many students look 
at many plots over an extended period of time, fluctua- 
tions at the level of three or more standard deviations 
are bound to appear simply from the fact that thou- 
sands of bins in various histograms have been consid- 
ered. The difficulty in computing this trials factor, the 
number of possible places that an interesting signal could 
have appeared, has hamstrung several previous search ef- 
forts that have attempted to base themselves on signa- 
tures rather than models. A rigorous accounting of the 
trials factor is crucial to any model-independent search; 
Sleuth is one of the few algorithms currently on the 
market that is able to compute this trials factor rigor- 
ously and explicitly. The HI Collaboration has developed 
an algorithm in similar spirit for HERA physics . 

Key to a rigorous computation of the trials factor is 
defining — before the data is collected — the interest- 
ingness of any particular signature that might be seen in 
those data. Sleuth is able to do this by making three 



well-justified assumptions. 

1. The data can be categorized into exclusive final 
states in such a way that any signature of new 
physics is apt to appear predominantly in one of 
these final states. 

2. New physics will appear with objects at high trans- 
verse momentum {pt) relative to standard model 
and instrumental background. 

3. New physics will appear as an excess of data over 
background. 

The Sleuth algorithm consists of three steps, following 
these three assumptions. 

In the first step, all of the collisions are partitioned 
into exclusive final states. The objects used to categorize 
these final states are high-pT and isolated electrons (e), 
muons (^), taus (r), photons (7), jets {j), b-tagged jets 
(6), and missing transverse energy (-^t)- 

The second step of the algorithm defines a low- 
dimensional variable space for each final state. In the 
Run I implementation of Sleuth, the variables used were 

• the summed transverse momentum of any leptons 
in the event {^Pt'^^ ^^'^)] 

• the missing transverse energy (-^t), if significant in 
the event; 

• the summed transverse momentum of any elec- 
troweak gauge bosons in the event {^Pt^^'^^"')] 
and 

• the summed transverse momentum of any jets in 
the event (^pt')- 

The Run II algorithm is simplified enormously by con- 
sidering only a single variable, 

• the summed transverse momentum of all objects in 
the event {Y1,Pt)- 

New high-pT physics is best searched for by systemati- 
cally looking for new physics at high px- 

The algorithm's third step involves searching for re- 
gions in which more events are seen in the data than 
expected from standard model and instrumental back- 
ground. This search is performed in the variable space 
defined in the second step of the algorithm, for each of 
the exclusive final states defined in the first step. 

The details of the search are somewhat involved, but 
both the input and output are exceptionally simple. For 
each final state, the input is simply the events seen in 
the data, and the expected background. The steps of the 
search can be sketched as follows. 

• The variable space is transformed into the unit box 
— the unit interval in one dimension, unit square 
in two dimensions, unit cube in three dimensions, 
and unit hypercube in four dimensions. 
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FIG. 3: A Voronoi diagram with seven data points (black 
dots) in a unit square (left). The Run I Sleuth algorithm 
considers regions that are unions of these cells, such as the 
shaded region (right). 

• The notion of regions about sets of data points is 
rigorously defined using the concept of Voronoi di- 
agrams, borrowed from the field of computational 
geometry. Figure |21 shows an example of a unit 
square containing seven data points, shown as black 
dots. The perpendicular bisectors of line segments 
connecting each pair of data points connect to form 
the Voronoi diagram. 

• The interestingness of any particular region (or 
union of such regions) in Fig.|31is the Poisson prob- 
ability that the background in that region would 
fluctuate up to or above the observed number of 
events in that region. 

• The most interesting region TZ is found using a 
search heuristic to explore the space of potentially 
interesting regions. 

• Pseudo experiments are performed to determine 
the fraction V of hypothetical similar experiments 
in which something more interesting than TZ would 
be seen. Here the fact that many different places 
have been considered is rigorously and explicitly ac- 
counted for. Sleuth and its HI analogue appear to 
be the only algorithms currently on the market for 
frontier energy collider physics that compute this 
trials factor completely and systematically. 

• The results from all final states considered are then 
combined to form V, which quantifies the interest- 
ingness of the most interesting region observed in 
the data, accounting for the fact that many final 
states have been considered. 

The Run II algorithm is trivial by comparison. In 
the single variable X^-Pt, semi- infinite regions are defined 
with a lower bound at each data point. The definition 
of interestingness, running of pseudo experiments, and 
calculation of V and V proceed as above. 



The output of the algorithm is the most interesting 
region TZ observed in the data, and a number V that 
quantifies the interestingness of TZ. V \s a, number be- 
tween zero and unity, pulled from a uniform distribution 
on the unit interval if the data comes from background 
alone, and expected to be small if the data contain a hint 
of new physics. A reasonable threshold for discovery is 
V < 0.001, which corresponds loosely to the de facto ha 
standard in our field after the trials factor is accounted 

for. [ni 

Two questions must now be asked: 

• Will Sleuth find nothing if there is nothing to be 
found? 

• Will Sleuth find something if there is something 
to be formd? 

The answer to the first is "yes," by construction [T^ . 
The answer to the second depends to what extent the 
new physics waiting to be uncovered satisfies the three 
assumptions on which Sleuth is based. 
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TABLE I: Summary of a Sleuth sensitivity study on the 
e^IpT, CfxlpTj, efx^TjO, and e^^rjjj final states. When the 
standard model processes WW and ti are omitted from the 
background estimate (second column). Sleuth identifies a 
region of excess in the e^lpr and efi^rjj final states (with 
V = 2.4(7 and 2.3(7, respectively), presumably indicating the 
true presence of WW and tt in the data. When the standard 
model process WW is included and tt omitted (third column). 
Sleuth identifies a region of excess in the efi^Tjj final state 
(with V = 1.9a), presumably indicating the true presence of 
tt in the data. With all standard model processes included to 
search for new physics (third column) , Sleuth indicates that 
72% {V = —0.6a) of background-only hypothetical similar 
experiments would have produced a region more interesting 
than the most interesting region observed in these data. 

Although no general answer can be given to this sec- 
ond question, an answer can be given for any specific 
case. Such a specific case is summarized in Table m*^. 
Events containing an energetic electron, muon, and pos- 
sibly other objects (e/zX) are considered. In a first pass, 
standard model WW and tt production are omitted from 
the background estimate to see if Sleuth is able to find 
evidence of these processes in D0 Run I data, and the re- 
sult V obtained in each final state (translated into units 
of standard deviations) is shown. Sleuth finds V = 2Aa 
and 2.3(7 in the final states efiflx and efiflxjj] these ex- 
cesses correspond (presumably) to the true presence of 
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WW and tt in these data. For comparison, a dedicated 
search for WW in Run I at CDF 7] resulted in 5 events 
observed on a background of 1.2 ± 0.3, corresponding to 
a significance of 2.3(t; and a dedicated search in e^X for 
tt by D0 in Run I 8] resulted in 5 events observed on a 
background of 1.4 ± 0.4, corresponding to significance of 
2.1cr. 

The quantity V obtained from Sleuth really should 
not be directly compared to the result of a dedicated 
search, since the two techniques are intended for very 
different problems: dedicated searches are clearly pre- 
ferred if there are well-defined, compelling things to be 
found, while Sleuth provides an alternative strategy in 
their absence. This example nonetheless provides useful 
intuition for Sleuth's performance on a difficult test. 

In a second pass, standard model WW production 
is included in the background estimate, with standard 
model production still omitted, to see whether Sleuth 
could find evidence of ti in these data. The results ob- 
tained are shown in the third column of Table with 
the excess in the e^fxjj final state corresponding (pre- 
sumably) to the actual presence of tt in these data. The 
slight indications of excess in these examples clearly fall 
well short of that needed to make a discovery claim; as 
indicated above, these are difficult tests. 

With all backgrounds included and Sleuth used to 
search for new physics in the fourth column of Table ^ 
a null result is obtained. The use of Sleuth to analyze 
roughly thirty additional final states at D0 in Tevatron 
Run I resulted in no evidence of new physics Q, H, ^ 111 - 

A general model-independent search in similar spirit 
has recently been presented by the HI collaboration at 
the 2003 European Physical Society meeting in Aachen, 
Germany. It will be interesting to continue to watch their 
/ijV final state in HERA Run II. 



III. QUAERO 

The first hint of new physics at Tevatron Run II may 
come from a model-independent search. Once such a hint 
is found, it must be interpreted in terms of an underlying 
physical theory. This interpretation would clearly be fa- 
cilitated by some means of quickly and efficiently testing 
the predictions of many different hypotheses against the 
data. QuAERO (Latin for "I search for," or "I seek" ) is 
an algorithm designed for this purpose. 

Present practice for testing hypotheses against collider 
data can be improved upon in several respects. A per- 
sonal wish list for conducting analyses on high energy 
collider data includes: 

• Reducing the time spent to perform an analysis 
from two years of one graduate student's life to 
roughly an hour of CPU time. Achieving this would 
represent a reduction in the time it takes to perform 
an analysis by a factor of 10^. 



• Reducing human bias that invariably creeps into 
analyses on complex data sets. 

• Allowing the publication of data in their full dimen- 
sionality, unrestricted by the two dimensions of a 
sheet of paper. 

• Providing an alternative to exclusion contours. The 
exclusion plots often shown make it difficult to un- 
derstand exactly what model is being tested, to- 
gether with all assumptions that are made, and 
difficult to tell what the data have to say about 
a model that does not lie in that two-dimensional 
space. 

• Automating the optimization of analyses, to ensure 
the data are used to their fullest. 

• Rigorously propagating systematic errors in an in- 
tuitive, straightforward, and rigorous way. 

• Combining results among correlated experiments in 
a manner that requires as few ad hoc prescriptions 
as possible. 

• Increasing the robustness of our scientific results by 
using a high-level analysis algorithm that has been 
validated on hundreds of previous analyses. 

• All of this on the web. 

A first pass of such an algorithm has been achieved. 
With the posting of an article entitled "Search for 
New Physics Using Quaero: A General Interface to 
D0 Event Data" D0 has made a subset of data 
collected in Tevatron Run I available on the web at 
http://quaero.fnal.gov/ since June 2001. 

Astrophysicists have become accustomed to polished 
interfaces to their data; the web page served up by the 
Sloan Digital Sky Survey at http://www.sdss.org/ is 
one of many examples. Those in the audience with this 
image in mind are bound to be disappointed by the look 
and feel of Fig. 0] — high energy physics is at least a 
decade behind the astrophysics and astronomy commu- 
nities on this front. 

The essential Quaero interface, devoid of adornment, 
is displayed in Fig. 01 A physicist with a particular 
hypothesis H to test against high energy collider data 
should be able to provide his hypothesis in the form of the 
events his model predicts — either as input to an event 
generator, or as a file with the events themselves. These 
events (the "signal"), together with whatever standard 
model processes ("backgrounds") he wishes to include, 
define his hypothesis for how Nature works at very small 
distance scales. 

After providing his name and the email address to 
which the result should be sent, the physicist clicks "Sub- 
mit." Quaero then performs the complete analysis, tak- 
ing into full account the expert knowledge gleaned within 
the collaboration and packaged into code, and returns a 
single number, quantifying the extent to which the data 
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FIG. 4: The Quaero web page under development for frontier 
energy collider data. 



(dis)favor the hypothesis relative to the standard model. 
Quaero also provides a number of plots showing in de- 
tail how the analysis was performed. Far from being 
a black box, Quaero arguably provides a much more 
transparent view into how analyses are performed than 
our standard publications. 

The Quaero algorithm itself is relatively simple, in- 
volving a few straightforward steps. 

• The events predicted by the hypothesis Ti, are run 
through the detector simulation appropriate for 
each experiment. 

• Events from Ti, the standard model (SM), and the 
data [V) are partitioned into exclusive final states. 
Speaking loosely, these final states are orthogonal 
(no event belongs to more than one final state) and 
complete (every event belongs to a final state). 

• Variables are chosen automatically within each fi- 
nal state. 

• A binning is chosen automatically within the vari- 
able space in each final state. 



• A binned likelihood is calculated within each final 
state. 

• Results from different final states are combined. 

• Results from different experiments are combined. 

• Systematic errors are integrated numerically. 

• The result returned is a likelihood ratio, 
p{V\H) 



c{n) 



p(2?|SM)' 



(1) 



In order to provide a feeling for the details of the al- 
gorithm within the space constraints of these proceed- 
ings, one piece of the algorithm is highlighted: automatic 
choice of binning. 



IV. OPTIMAL BINNING FOR LIKELIHOOD 
RATIOS 

A binned likelihood provides a robust yet sensitive 
method for discriminating between two hypotheses. But 
how should the bins be chosen? Somewhat surprisingly, 
the literature does not yet appear to contain a satisfac- 
tory general prescription for choosing an optimal binning. 
This section suggests such a prescription, investigates its 
implications in several limiting cases, and provides ex- 
amples of its use. 

FiguresEJa) and (b) show a typical problem. Predicted 
(analytic) distributions from two hypotheses h and b are 
shown in Fig. [Sja). Units on the vertical axis are the 
number of predicted events per unit of a:, the observable 
shown on the horizontal axis. Often the analytic form of 
the predictions are not known, however; knowledge of the 
predictions from h and b come in the form of an ensemble 
of Monte Carlo events, whose statistics are limited by the 
complexity of the simulation required for each event. 

It is desired to perform an experiment to collect data 
d to determine whether hypothesis /i or & is the more ac- 
curate description of Nature. The number we would like 
to determine is the likelihood ratio p{d\h) / p{d\b) — in 
words, the probability of obtaining the data d assuming 
the correctness of the hypothesis h, divided by the prob- 
ability of obtaining the data d assuming the correctness 
of the hypothesis b. 

Given the predictions from h and b shown in Fig.jSjb), 
how should this likelihood ratio be computed? If the pre- 
dictions h{x) and b{x) were known as analytic functions 
of X, as in Fig. El^a), an unbinned likelihood could be 
calculated. But the analytic forms h{x) and b{x) are not 
known. Constructing smooth distributions h{x) and b{x) 
from Monte Carlo points using smoothing techniques is 
possible, but the final answer is often unfortunately sen- 
sitive to the details of how this smoothing is performed. 
The only reasonable option appears to involve the intro- 
duction of bins, and the computation of a binned likeli- 
hood. 
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FIG. 5: Placement of bins in a toy example: a bump on a 
falling exponential. The true (unknown, analytic) distribu- 
tions h{x) (solid, green) and b(x) (dashed, red) are shown in 

(a) ; our knowledge of these distributions, in the form of 1000 
Monte Carlo points drawn from each, is shown in (b). In this 
case h{x) is a simple exponential, with h{x) adding a Gaus- 
sian bump centered at x = 20. The vertical axes in (a) and 

(b) represent the number of events expected per unit of x. 
Sequential placement of bin edges is shown in (c)-(f), with 
the figure of merit M on the vertical axis. 



FIG. 6: Placement of bins in a toy example: two Gaussians of 
different widths. The true (unknown, analytic) distributions 
h{x) (solid, green) and b(x) (dashed, red) are shown in (a); our 
knowledge of these distributions, in the form of 1000 Monte 
Carlo points drawn from each, is shown in (b). In this case 
b{x) is a Gaussian centered at 25 with width 8; h(x) is a 
Gaussian with the same mean and width 5. The vertical axes 
in (a) and (b) represent the number of events expected per 
unit of X. Sequential placement of bin edges is shown in (c)- 
(f), with the figure of merit M on the vertical axis. 



But then how should the bins be set? The bins must 
clearly be fine enough to probe the dilTerence in shape be- 
tween the two distributions; the bins must just as clearly 
be large enough that an accurate prediction is obtained 
for the number of events hk and hk in each bin k. The 
issue at hand is not only how many bins to use, but also 
where to place their edges. 

There is no unique solution to this problem. The best 
one can do is to define a reasonable prescription for choos- 
ing an optimal binning — in effect, by suggesting some 
reasonable definition of "optimal" — and demonstrate 
its reasonable behavior on a variety of examples. 

The prescription suggested here involves defining a fig- 
ure of merit by 

oo oo / \ 

di=0ci2=0 \ k I 
, / TT PKb(fefc)) \ 

Sl^lip(dfeb(6,))y 

^{h'^h)-V. (2) 

In words, M. is the evidence the experiment is expected 
to provide in favor of /i if /i is correct, plus the evidence 



the experiment is expected to provide in favor of 5 is 6 
is correct. The definition of "evidence" here, adopted 
from Ref. is the logarithm of the likelihood ratio; 
"expected" is defined in terms of an average over an en- 
semble of hypothetical experiments, where the correct- 
ness of either /i or 6 is assumed in weighting the possible 
outcomes. 

The initial sum in Eq.|21is over all possible outcomes of 
the experiment: the number of data events dfe in each bin 
k is allowed to vary between zero and infinity. The factor 
W.kV{<^k\v^k)) weights each outcome by the probability 
of its occurrence, assuming the correctness of h. Here 
^(hk) is our knowledge of the number of events predicted 
by h in bin k\ we might have p{hk) in the form of a 
Gaussian with mean 7 and width 1.2 if the number of 
events predicted by h in bin k were 7 ± 1.2. The factor 

log (rife p(dfc|p(bfc)) ) is evidence obtained in favor of 
h in this outcome. To this is added a similar term with h 
and b swapped; the second term (h <^ b) is the expected 
evidence in favor of & if 6 is correct. The third term V 
is a penalty term, which provides the stopping condition 
for the algorithm's placement of bins. 

Figures |31 and ini show how this figure of merit A4 can 
be used to determine the placement of bins. Figure |SJ a) 
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shows the true (analytic and unknown) predictions h{x) 
and h{x) from the hypotheses h and b in the observable 
X. Figure Elb) shows our knowledge of the predictions 
of these two hypotheses, in the form of one thousand 
Monte Carlo points drawn from the true (unknown) dis- 
tributions h{x) and h{x). Using a single bin from to 
50 in X, the figure of merit computed using Eq.[51for the 
points shown in Fig.|S^b) is = 0.05. 

Figures |5lc)-(f) show the successive placement of bin 
edges. In these plots the vertical axis has units of ex- 
pected evidence; the predictions of h and h are super- 
imposed with arbitrary scale. At each value of x, the 
dark (jagged, blue) curve shows the figure of merit M if 
a bin edge is placed at that point. In Fig.EJc), the max- 
imum of this curve is obtained with a bin edge placed at 
X = 18; this raises the figure of merit to = 0.60. Plac- 
ing a bin edge at this point results in one bin stretching 
from to 18, and one bin reaching from 18 to 50. Fig- 
ure l^d) shows this process repeated, the figure of merit 
calculated for each possible location of a second bin edge. 
Maximizing the expected evidence in the dark (jagged, 
blue) curve requires placement of a bin edge at a; = 23. 
This placement leaves three bins: [0-18], [18-23], and 
[23-50] . Figures |5je) and (f ) show the placement of two 
more bin edges, at a; = 30 and at x — 25. Further place- 
ment of bin edges decreases the figure of merit Ai, so the 
algorithm halts. 

A second example is shown in Fig. Figure El^a) 
shows the true (analytic and unknown) predictions h{x) 
and b{x), both Gaussians with identical mean and area 
but different widths. One thousand Monte Carlo points 
pulled from each of h{x) and b{x) are shown in Fig.|Bfb). 
The use of a single bin from to 50 results in a figure 
of merit oi A4 ^ —0.14; a negative value is obtained be- 
cause the total number of events predicted by h and b 
in this single encompassing bin is the same (the Gaus- 
sians have equal area), and the penalty term V in Eq. [3 
drives the figure of merit Ai negative. In Fig. |Hl[c)-(f), 
the units of the vertical axes are expected evidence, with 
the predictions of h and b again superimposed. Notice 
the difference in vertical scale between Figs.Etc)-(f) and 
Figs.EIc)-(f); the evidence we expect the experiment to 
provide in favor of h relative to b (or vice versa) is clearly 
much larger in the example of Fig. El 

A first bin edge is placed in Fig.EIc), the figure of merit 
A4 computed as the bin edge's position is scanned in x, 
resulting in the dark (jagged, blue) curve. As expected 
by looking at the true distributions for h and b, the algo- 
rithm prefers bin placement at x ~ 15 or a; « 35, where 
the analytic predictions for h and b cross in Fig.l^a). In 
Fig. Etc), placement of a bin edge at x = 32 is slightly 
favored. The first bin edge is placed at this point, re- 
sulting in one bin ranging from to 32, and a second 
bin covering 32 to 50. The process is repeated, with the 
expected evidence curve shown in Fig. E^d), and a sec- 
ond bin is placed at x = 17, raising the total figure of 
merit to M = 6.26. Figures Efe) and (f) show the pro- 
cess repeated twice more, raising the total figure of merit 



to Ai = 8.15. The algorithm places eight additional bin 
edges in the regions x w 20 and x « 30 before halting. 

The algorithm's performance in these two cases is re- 
markably intuitive. In the first example, the procedure 
nicely carves out the region around the bump that h{x) 
shows relative to 6(x) in Fig. E^a), correctly ignoring 
the bulk of the distribution at x < 10 and the tail at 
X > 30. In the second example, the algorithm system- 
atically works from side to side in Fig. El from the right 
of the mean to the left of the mean and back, doggedly 
separating regions in which h predicts more events than 
b from regions in which b predicts more than h. 

The algorithm presented here has at least two multi- 
variate generalizations. One option iteratively places bin 
edges in the form of hyperplanes parallel to the variable 
axes, creating a grid in the multidimensional space. In 
some cases this may be an acceptable approach, but the 
resulting rectangular bins are too constrained in shape to 
adequately handle an arbitrary multidimensional prob- 
lem. An (improved) alternative is to use kernel density 
estimation to first reduce the problem to a single dimen- 
sion, enabling the application of the one-dimensional bin- 
ning algorithm just described. 

V. FEWKDE 

Standard kernel estimation involves placing bumps of 
probability, typically in the form of Gaussian kernels, 
around each Monte Carlo point. Summation of kernels 
placed around each of an ensemble of Monte Carlo points 
forms the density estimate. 

In this standard approach, the evaluation of the den- 
sity at any particular point requires the evaluation of a 
Gaussian centered at each of of the Nmc Monte Carlo 
points. The time cost of evaluating this density estimate 
at each of the points used to generate the estimate thus 
grows as 0{N'^q), which becomes prohibitive when deal- 
ing with samples of > 10^ Monte Carlo points. Applica- 
tion to high statistics Tevatron and future LHC analyses 
is facilitated by noting that distributions derived from 
four-vector quantities of final state objects in high-pT 
collider physics can be approximated satisfactorily by the 
sum of just a few Gaussians. 

An algorithm called FewKDE has been introduced 
with the generally featureless nature of our distributions 
in mind, where "FewKDE" is shorthand for "kernel den- 
sity estimation with few kernels." The parameters of the 
few Gaussians are chosen to provide the best fit to the 
data. A novel technique is employed to appropriately 
handle the types of hard physical boundaries (such as 
Pt > 0) that exist in commonly considered distributions. 

VI. SUMMARY 

These proceedings have briefly sketched a method al- 
lowing the systematic analysis of high energy collider 
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data. After briefly providing the experimental and the- 
oretical contexts of frontier energy collider data to the 
statisticians, astrophysicists, and cosmologists in the au- 
dience, a direct solution to a few of the problems we face 
in the analysis of those data has been described. 

Given the variety of possible forms physics beyond the 
standard model may take, the question of how to search 
for something when we know only vaguely what it is we 
are searching for becomes acute. Sleuth is an algo- 
rithm that accomplishes this in a rigorous and system- 
atic way, enabling a model-independent search for new 
high-pT physics. 

Once a hint of new physics is observed, data under- 
stood in the context of a systematic search must be in- 
terpreted in terms of the underlying physical theory. Ac- 
complishing this requires a procedure for quickly and ef- 
ficiently testing particular hypotheses against the data. 
QuAERO provides a qualitatively new medium for facili- 
tating this interpretation. 

In order to provide a feeling for one of several algorith- 
mic pieces introduced in the development of QuAERO, 
a procedure for optimally choosing a binning for the 
computation of a binned likelihood ratio has been de- 
scribed. Generalization to the multivariate case makes 
use of FewKDE, a time-saving variant of the standard 
procedure for kernel density estimation. 

It is our hope that the ideas presented here, developed 
for a particular problem within high energy physics, may 
lend themselves to many other problems in the physical 
sciences. 
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